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Abstract 



Many applications use sequences of n consecutive symbols (n-grams). Hashing these 
n-grams can be a performance bottleneck. For more speed, recursive hash families 
compute hash values by updating previous values. We prove that recursive hash fam- 
ilies cannot be more than pairwise independent. While hashing by irreducible poly- 
nomials is pairwise independent, our implementations either run in time 0{n) or use 
an exponential amount of memory. As a more scalable alternative, we make hashing 
by cycUc polynomials pairwise independent by ignoring n — 1 bits. Experimentally, 
we show that hashing by cyclic polynomials is twice as fast as hashing by irreducible 
polynomials. We also show that randomized Karp-Rabin hash families are not pairwise 
independent. 
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1. Introduction 

An n-gram is a consecutive sequence of n symbols from an alphabet E. An n-gram 
hash function h maps «-grams to numbers in [0, 2^) . These functions have several appli- 
cations from full-text matching H]^, pattern matching or language models ||5f [TTl 
to plagiarism detection 1 12 1. 

To prove that a hashing algorithm must work well, we typically need hash values 
to satisfy some statistical property. Indeed, a hash function that maps all n-grams to a 
single integer would not be useful. Yet, a single hash function is deterministic: it maps 
an n-gram to a single hash value. Thus, we may be able to choose the input data so that 
the hash values are biased. Therefore, we randomly pick a function from a family !}{ 
of functions [13] . 

Such a family !}{ is uniform (over L-bits) if all hash values are equiprobable. That 
is, considering h selected uniformly at random from 9{, we have P{h{x) = y) — 1 /2^ 
for all n-grams x and all hash values y. This condition is weak; the family of constant 
functions ih{x) = c) is uniforrnM 
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Intuitively, we would want that if an adversary knows the hash value of one n-gram, 
it cannot deduce anything about the hash value of another «-gram. For example, with 
the family of constant functions, once we know one hash value, we know them all. 
The family 9{ is pairwise independent if the hash value of «-gram xi is independent 
from the hash value of any other n-gram X2- That is, we have P{h{xi) — y /\ h{x2) = 
z) = P{h{xi) = y)P{h{x2) — z) — 1/4^ for all n-grams xi, X2, and all hash values y, 
z with xi 7^ X2- Pairwise independence implies uniformity. We refer to a particular 
hash function hG y{ as "uniform" or "a pairwise independent hash function" when the 
family in question can be inferred from the context. 

Moreover, the idea of pairwise independence can be generalized: a family of hash 
functions !}{ is k-wise independent if given distinct xi , . . . , x^. and given h selected uni- 
formly at random from 9{, then P{h{x\) = y\ h ■ ■ ■ hh{xi^) ^yt) = 1 72*^^. Note that 
A;- wise independence implies k— 1-wise independence and uniformity. (Fully) inde- 
pendent families are A:-wise independent for arbitrarily large k. For applications, non- 
independent families may fare as well as fully independent families if the entropy of 
the data source is sufficiently high |16|. 

A hash function h is recursive 1 17] — or rolling 1 18 1 — if there is a function F com- 
puting the hash value of the «-gram X2 . . .x„+i from the hash value of the preceding 
n-gram (xi . . .x„) and the values of x\ and Xn+i- That is, we have 

h{x2, ■ ■ ■ ,x„+i) = F{h{x\,. . . ,x„),xi,x„+i). 

Ideally, we could compute function/^ in time 0{L) and not, for example, in time 0{Ln). 
The main contributions of this paper are: 

• a proof that recursive hashing is no more than pairwise independent (§[3]l; 

• a proof that randomized Karp-Rabin can be uniform but never pairwise indepen- 
dent (§|5); 

• a proof that hashing by irreducible polynomials is pairwise independent (§|7]|; 
a proof that hashing by cyclic polynomials is not even uniform (§[9]); 

• a proof that hashing by cyclic polynomials is pairwise independent — after ignor- 



ing n — 1 consecutive bits (§ 10 1. 



We conclude with an experimental section where we show that hashing by cyclic poly- 
nomials is faster than hashing by irreducible polynomials. Table [T] summarizes the 
algorithms presented. 



2. lyailing-zero independence 

Some randomized algorithms lfT4l [TSl merely require that the number of trailing 
zeroes be independent. For example, to estimate the number of distinct n-grams in 
a large document without enumerating them, we merely have to compute maximal 
numbers of leading zeroes k among hash values ||T9I| . Naively, we may estimate that 
if a hash value with k leading zeroes is found, we have w 2^ distinct «-grams. Such 



Table 1: A summary of the hashing function presented and their properties. For GENERAL and CYCLIC, we 
require L > n. To make CYCLIC pairwise independent, we need to discard some bits — the resulting scheme 
is not formally recursive. Randomized Karp-Rabin is uniform under some conditions. 

cost per «-gram independence memory use 



non-recursive 3-wise (§|4ji 0{Ln) 

Randomized Karp-Rabin (§[5} 0{LlogL2'^^'°s*L)) 

General (§|7]i 0{Ln) 
RAM-Buffered General (§[8]i 0{L) 



Cyclic (§ 9 



0{L + n) 



3-wise <5(nL|E|) 
uniform C»(L|E|) 
pairwise (9(L|E|) 

pairwise C»(L|E|+L2") 

pairwise (§fTo| 0((L + n)|E|) 



estimates might be useful because the number of distinct n-grams grows large with n: 
Shakespeare's First Folio |2Q| has over 3 million distinct 15-grams. 

Formally, let zeros(jc) return the number of trailing zeros (0,1,. . . ,L) of x, where 
zeros(O) = L. We say h is k-wise trailing-zero independent if f (zeros(/i(xi)) > i\ A 
zeros(/!(x2)) > 72 A ... A zeros (/i(xa:)) > jk) = 2'^^ -J'- for J,- = 0, 1 , . . . ,L. 

If h is A:-wise independent, it is A;-wise trailing-zero independent. The converse is 
not true. If /z is a ^-wise independent function, consider goh where g makes zero all bits 
before the rightmost 1 (e.g., g(0101 100) = 0000100). Hash go /; is A:-wise trailing-zero 
independent but not even uniform (consider that P{g — 0001) — 8P(g = 1000)). 



3. Recursive hash functions are no more than pairwise independent 

Not only are recursive hash functions limited to pairwise independence: they can- 
not be 3-wise trailing-zero independent. 

Proposition 1 There is no 3-wise trailing-zero independent hashing function that is 
recursive. 

Proof Consider the (« + 2)-gram a"bb. Suppose /z is recursive and 3-wise trailing- 
zero independent, then 



p(^zeros(/i(a,...,a)) >L/\ 

zeros (/i(a, . . . ,a,b)) > L/\zeros(/!(a, . . . ,a,b,b)) >L 
= P (/2(a, . . . ,a) = 0/\F(0, a,b) = 0/\F(0,a,b) = o) 
= p(/2(a,...,a)=0A^'(0,a,b)-0) 

= P ^zeros(/!(a, . . . ,a)) > L /\zevos{h{a, . . . ,a,b)) > L 

= 2^^^ by trailing-zero pairwise independence 

7^ 2^^^ as required by trailing-zero 3-wise independence. 



Hence, we have a contradiction and no such /; exists. 



4. A non-recursive 3-wise independent hash function 



A trivial way to generate an independent hash is to assign a random integer in [0,2^) 
to each new value x. Unfortunately, this requires as much processing and storage as a 
complete indexing of all values. 

However, in a multidimensional setting this approach can be put to good use. Sup- 
pose that we have tuples \n K\ x K2 x ■ ■ ■ x K„ such that \Ki\ is small for all /. We 
can construct independent hash functions /i, : Kt [0,2^) for all / and combine them. 
The hash function h{xi,X2, ■■■,Xn) — hi{xi) (B h2{x2) ® ■ ■ ■ ® hn{xn) is then 3-wise in- 
dependent (0 is the "exclusive or" function, XOR). In time 0{Y."=i \Ki\), we can con- 
struct the hash function by generating ^"^j | random numbers and storing them in a 
look-up table. With constant-time look-up, hashing an «-gram thus takes 0{Ln) time. 
Algorithm [T] is an application of this idea to n-grams. 

Algorithm 1 The (non-recursive) 3-wise independent family. 

Require: n L-bit hash functions hi,hi, . . . ,h„ over E from an independent hash family 

V. s ^ empty FIFO structure 

2: for each character c do 

3: append c to i 

4: if length(.s)= n then 

5: yield hi{si)®h2{s2)®...®h„{s„) 

{The yield statement returns the value, without terminating the algorithm.} 

6: remove oldest character from s 
7: end if 
8: end for 



This new family is not 4-wise independent for « > 1. Consider the n-grams ac,ad, 
be, bd. The XOR of their four hash values is zero. However, the family is 3-wise 
independent. 

Proposition 2 The family of hash functions h{x) = h\ {x\) © h2{x2) ® . . . h„(x„), 
where the L-bit hash functions h\,. . . ,h„ are taken from an independent hash family, is 
3-wise independent. 

Proof Consider any 3 distinct «-grams: jc'^'^ ^ x^^ \ . .x[}\ x'^' — x^^'' . . .x^p , and 

x^^^i — xj^^ . . .xi^'. Because the n-grams are distinct, at least one of two possibilities 
holds: 

Case A For some / e { 1 , . . . , n}, the three values xf\xf'\xf''' are distinct. Write %j = 
hi{xf '') for j — 1,2,3. For example, consider the three 1-grams: a,b,c. 

Case B (Up to a reordering of the three n-grams.) There are two values /, j G { 1 , . . . , n} 
such that xj^' is distinct from the two identical values xf'\xf \ and such that x^'P 
is distinct from the two identical values x\^\xf\ Write Xi = hi{x^j^''), ^2 = 
hj{x''p), and = /^/(-'cp'')- For example, consider the three 2-grams: ad,bc,bd. 



Recall that the XOR operation is invertible: a(Bb = cif and only if a = b(Bc. 
We prove 3-wise independence for cases A and B. 

Case A. Write = /z(jcW) © 5^,- for / = 1,2,3. We have that the values Xi,X2j,X3 ^re 
mutually independent, and they are independent from the values /'^^',/'^\/'^'rJ 

^ ( A - A A - y'A - fl^(5C< = ydP ( A = y'] 

\i=l i=l / '=1 \i=l / 

for all values yi,y'i- Hence, we have 

= P (xi = z^'^ ® AX2 = z(^) ©/^) AX3 = z(^) ® /^)) 

= L ^(X1=2*"©11A5C2=2'''©11'A5C3-2*''©11") X 

T|,T|',T|" 

p(/(i)==nA/(2)=n'A/P)=ri") 

Tl.Tl'.Tl" 
1 

Thus, in this case, the hash values are 3-wise independent. 

Case B. Write /(i) = Xi. /'^^ = /!(-*^'^') © X2 © X3, = © X3- 

Again, the values Xi , X2, X3 mutually independent, and independent from the values 
^{i) j{2) _ We have 

= P(xi=Z<'>©/<'')AX2©X3=z('^©/'>AX3=Z<''©/''^) 

= P (xi = ©/^') A5C2 - ©/'' ©z<^' ©/'' AX3 = ©/^^) 
= E ^'(xi=^^''©ilAx2 = z*''©^'''©il'©il"A5C3-z'''©il") X 

r|,r|',r|" 

^(/''- ='nA/P)=n'A/3) = ri") 

= E i^(/''=nA/2)=VA/(^)=V') 
r|,T|',T|" 

1 

23^' 



^The values /''',/'^' ,/'^' are not necessarily mutually independent. 



This concludes the proof. 



5. Randomized Karp-Rabin is not independent 

One of the most common recursive hash functions is commonly associated with the 
Karp-Rabin string -matching algorithm [21 J. Given an integer B, the hash value over 
the sequence of integers xi,X2, ...,x„ is Y!i=\^iB"' ■ A variation of the Karp-Rabin 
hash method is "Hashing by Power-of-2 Integer Division" 1 17], where h{x\, . . . ,Xn) = 
Y!i=i^iB"^' mod 2^. In particular, the hashcode method of the Java String class uses 
this approach, with L = 32 and B — ?>\ ll22l . A widely used textbook [23 p. 157] 
recommends a similar Integer-Division hash function for strings with B ~ 37. 

Since such Integer-Division hash functions are recursive, quickly computed, and 
widely used, it is interesting to seek a randomized version of them. Assume that hi 
is a random hash function over symbols uniform in [0,2^), then define h{x\ ,x„) = 

+ B"^^hi{x2) H \- hi{x„) mod 2^ for some fixed integer B. We choose 

B = 37 (calling the resulting randomized hash "ID37;" see Algorithm|2]). Our algorithm 
computes each hash value in time 0(M(L)), where M{L) is the cost of multiplying two 
L-bit integers. (We precompute the value B" mod 2^.) In many practical cases, L bits 
can fit into a single machine word and the cost of multiplication can be considered 
constant. In general, M(L) is in 0(LlogL2'^(i°s*^)) f24\ . 



Algorithm 2 The recursive ID37 family (Randomized Karp-Rabin). 
Require: an L-bit hash function hi over E from an independent hash family 
1: B^37 

2: s ^ empty FIFO structure 

3: x^O (L-bit integer) 

4: z <— (L-bit integer) 

5: for each character c do 

6: append c to i 

7: x^Bx-B"z + hi{c)mod2^ 

8: if length(i)= n then 

9: yield x 

10: remove oldest character y from s 
11: z^hi{y) 
12: end if 
13: end for 



The randomized Integer-Division functions mapping n-grams to [0, 2^) are not pair- 
wise independent. However, for B odd, they are uniform. 

Proposition 3 Randomized Integer-Division hashing with B odd is not uniform for 
n-grams, if n is even. Otherwise, it is uniform, but not pairwise independent. 

Proof For B odd, we see that P{h{a^'') = 0) > 2"^ since h{a^'') = hi (a)(B°(l +B) + 
B^{1 +B} H \-B^''-^{l +B)) mod 2^ and since (1 +B) is even, we have P{h{a^'') = 



0)>P{hi{xi)=2^-^Vhi{xi)=0) = 1/2^"'. Hence, for B odd and n even, we do not 
have uniformity. 

For the rest of the resuh, we begin with n = 2 and B even. If xi ^ X2, then 
P{h{xi,X2) ^ y) = P{Bhi{xi) + hi{x2) = 3'mod2^) = L,P(/ii(x2) y - Bz mod 
2^)P{hx{x\)^z)^Y,,P{hx{x2)=y-Bz mod 2^)/2^ = l/2^ whereas P{h{xi,xi) = 
y) =P{{B+ =y mod 2^) = 1/2^ since {B+ l)x = y mod 2^ has aunique so- 

lution X when B is even. Therefore /; is uniform. This argument can be extended for 
any value of « and for « odd, B even. 

To show it is not pairwise independent, first suppose that B is odd. For any string 
P of length n — 2, consider n-grams wi — Paa and W2 = pbb for distinct a,b G E. 
Then P{h{wi)=h{w2))=P{B^h{^)+Bhi{a) + hi{a)=B^h{^)+Bhi{h) + hi{h) mod 
2^) = P((l +B)(/ii(a) -/!i(b)) mod 2^ 0) > P{hi{a) - hi{h) = 0) +P(/!i(a) - 
/ii(b) =2^"'). Because /ii is independent, f(/ii(a) — /!i(b) =0) = Lcg[o,2^)^(^i(s) = 
c)P(/ii(b) = c) = I,:e[o.2'-) 1/4^ = 1/2^- Moreover, f (/ii(a) - /ii(b) 2^-') > 0. 
Thus, we have that P{h{wi) = h{w2)) > 1/2^ which contradicts pairwise inde- 
pendence. Second, if B is even, a similar argument shows P{h{wi) = h{w4)) > 
1/2^, where W3 = paa and W4 = pba. P(/i(a,a) = /i(b,a)) = P(Mi(a) +/ii(a) = 
Bhi (b) + /ii (a) mod 2^) = P{B{hi (a) - hi (b)) mod 2^ = 0) > (a) - /zi (b) = 0) + 
P{hi (a) — /ii (b) = 2^^ ' ) > 1/2^. This argument can be extended for any value of B 
and n. 

A weaker condition than pairwise independence is 2-universality: a family is 2- 
universal if P(/z(xi) = h{x2)) < 1/2^ 1 16 1. As a consequence of this proof. Randomized 
Integer-Division is not even 2-universal. 

These results also hold for any Integer-Division hash where the modulo is by an 
even number, not necessarily a power of 2. 

6. Generating hash famiUes from polynomials over Galois fields 

A practical form of hashing using the binary Galois field GF(2) is called "Recursive 
Hashing by Polynomials" and has been attributed to Kubina by Cohen ifTTl . GF(2) 
contains only two values (1 and 0) with the addition (and hence subtraction) defined by 
XOR, a + b — a(Bb and the multipHcation by AND, axb — aAb. GF(2) [x] is the vector 
space of all polynomials with coefficients from GF(2). Any integer in binary form (e.g., 
c = 1101) can thus be interpreted as an element of GF(2)[;ic] (e.g., c = x^ +x^ + 1). 
If p{x) e GF(2)[x], then GF{2)[x]/p{x) can be thought of as GF(2)[jc] modulo p{x). 
As an example, if p{x) = x^, then GF{2)[x]/p{x) is the set of all linear polynomials. 
For instance, x^ +x^ +x+l = l mod x^ since, in GF(2) [x], {x+ 1) +x^{x+ 1) = 

+ + X + 1 . 

As a summary, we compute operations over GF(2)[x]//?(x) — where p{x) is of de- 
gree L — as follows: 

• the polynomial YIiZq li^^ is represented as the L-bit integer Y^Z^ 



• subtraction or addition of two polynomials is the XOR of their L-bit integers; 



Table 2: Some irreducible polynomials over GF(2)[a:] 



degree polynomial 
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1- 






15 
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20 


1- 






25 


1- 
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30 




-xU 





• multiplication of a polynomial 1>^' the monomial x is represented either 
as Lfjo^ ^i-x'^' if ^L-i = or as /?(x) + YJiZq lix'^^ otherwise. In other words, 
if the value of the last bit is 1, we merely apply a binary left shift, otherwise, 
we apply a binary left shift immediately followed by an XOR with the integer 
representing p{x). In either case, we get an L-bit integer. 

Hence, merely with the XOR operation, the binary left shift, and a way to evaluate the 
value of the last bit, we can compute all necessary operations over GF(2) [x] / p{x) using 
integers. 

Consider a hash function h i over characters taken from some independent family. 
Interpreting hi hash values as polynomials in GF(2)[x]//?(x), and with the condition 
that degree(/?(x)) > n, we define a hash function as h{ai,a2,--- ,a„) — hi{ai)x"^^ + 

hi{a2)x"^^ H \-hi{a„). It is recursive over the sequence /!i(fl,). The combined hash 

can be computed by reusing previous hash values: 

/i(fl2,fl3, •••,««+!) ^ xh{ai,a2, . . . ,an) - hi{ai)x" + hi{a„+i). 

Depending on the choice of the polynomial p{x) we get different hashing schemes, 
including GENERAL and CYCLIC, which are presented in the next two sections. 

7. Recursive hashing by irreducible polynomials is pairwise independent 

We can choose p{x) to be an irreducible polynomial of degree L in GF(2)[x]: an 
ureducible polynomial cannot be factored into nontrivial polynomials (see Table |2]l. 
The resulting hash is called GENERAL (see Algorithm |3]l. The main benefit of setting 
p{x) to be an irreducible polynomial is that GF(2) [x]/p(x) is a field; in particular, it 
is impossible that pi{x)p2{x) = mod p{x) unless either pi{x) = or P2{x) = 0. The 
field property allows us to prove that the hash function is pairwise independent. 

Lemma 1 General is pairwise independent. 

Proof If p{x) is irreducible, then any non-zero ^(x) e GF(2)[x]//?(x) has an inverse, 
noted ^^'(x) since GF(2)[x]/p(x) is a field. Interpret hash values as polynomials in 
GF(2)[x]/;,(x). 

Firstly, we prove that General is uniform. In fact, we show a stronger re- 
sult: P{q\{x)hi{ai) + q2{x)h\{a2) + ■ ■ ■ + qn{x)h\{an) = y) = 1/2^ for any polyno- 
mials qi where at least one is different from zero. The result follows by induction 



Algorithm 3 The recursive General family. 

Require: an L-bit hash function hi over E from an independent hash family; an irre- 
ducible polynomial p of degree L in GF(2) [x] 

1: i <— empty FIFO structure 

2: x-t—Q (L-bit integer) 

3: z ■*— (L-bit integer) 

4: for each character c do 

5: append c to * 

6: X <— shift(x) 

7: shift" (z) 

8: X*-X©Z©/?l(c) 

9: if length(s)= n then 

10: yield x 

1 1 : remove oldest character y from s 

12: Z^/llW 

13: end if 

14: end for 

1: function shift 

2: input L-bit integer x 

3: shift X left by 1 bit, storing result in an L + l-bit integer x' 

4: if leftmost bit of x' is 1 then 

5: x' ^x'®p 

6: end if 

7: {leftmost bit of x' is thus always 0} 

8: return rightmost L bits of x' 



on the number of non-zero polynomials: it is clearly true where there is a single 
non-zero polynomial ^,(x), since ^,(x)/zi(a,) = y qi'^{x)qi{x)hi{ai) =q^^{x)y. 

Suppose it is true up to A: — 1 non-zero polynomials and consider a case where we 
have k non-zero polynomials. Assume without loss of generality that qi{x) ^ 0, we 

haveP(^l(x)/!i(fli)+^2(x)/!i(a2)H Vq„[x)hx{an) = y) =P(h\(a\) = q\^{x)(y- 

qi{x)h\{ai) ^„(x)/ii(fl„))) = 'ZyiPi^xiai) = q\^ (x)(y - y'))P{q2(x)hi(a2) + 

V q„{x)hi{a„) = y') = X)y ^ ^ = ^ by the induction argument. Hence the uni- 
formity result is shown. 

Consider two distinct sequences ai,a2,---i«n and Write Ha = 

h{ai,a2,...,a„) and = h{a\,a2, . . . ,a'„). We have that P{Ha = yAHa' = y') = 
P{Ha = y\Ha' = y')P{Ha' = /). Hence, to prove pairwise independence, it suffices 
to show that P{Ha = y\Ha' = y') = 1/2^. 

Suppose that a, = a'j for some if not, the result follows since by the (full) 
independence of the hashing function hi, the values //„ and H^i are independent. Write 
lix) = -(Li;|«^=„,x""'')(Iija^=o^x""*)"\ then Ha + q{x)Ha' is independent from a,- = 

a'j (and hi (a,) = hi {a'j)). 

In Ha + q{x)Hai, only hashed values hi{ai^) for a<. ^ a,- and h\{a'i^ for ^ a'j 
remain: label them hi{bi),.. .,hi{bm)- The result of the substitution can be written 



Ha + q{x)Ha' — Y.kQk{x)h\{bi^) where qk{x) are polynomials in G¥{2)[x\/ p{x). All 
qk{x) are zero if and only if //„ +q{x)Hj — for all values of hi{a\), . . . ,h\{an) and 
/ii(aj), . . . ,/zi(aJ,) (but notice that the value /ii (a,) = hi{a'j) is irrelevant); in particular, 

it must be true when hi{ak) = 1 and hi{a[) — 1 for all k, hence (x" H \-x+ 1) + 

q{x){x" . . . +X+ 1) = q{x) = — 1. Thus, all qk{x) are zero if and only if //„ — 
for all values of ),..., /zi(fl„) and hi{a\), . . . ,hi{a'„) which only happens if the 
sequences a and a' are identical. Hence, not all qk{x) are zero. 

Write //yy = (LkU =a'.x"'''yHy' - 'LkU ^a'.x"'''hi{a[)). On the one hand, the 

condition — y' can be rewritten as hi{a'j) — Hy ^i. On the other hand, //„ + 
q{x)H^i — y + q{x)y' is independent from hi{a'j) — Because P{h\{a'j) = 

Hyi.a') = 1/2"^ irrespective of y' and hi{a[) for G {k\a[ ^ a'j}, then P{hi{a'j) = 
HyJ ^a'\Ha+q{x)H^i —y+q{x)y') =P{h\{a'j) —Hyi^i) which implies that /ii(fly) —Hyi^a' 
and //„ + q{x)Hj —y + q{x)y' are independent. Hence, we have 

P{H,^y\Hj=y') 

= P{Ha + q{x)Hj=y + q{x)y'\hy{a))^Hy,^^,) 

= P{Ha + q{x)H„,=y + q{x)y') 

= P(£qk{x)hi{bk)^y + q{x)y') 
k 

and by the earlier uniformity result, this last probability is equal to 1 /2^. This con- 
cludes the proof 

8. Trading memory for speed: RAM-Buffered General 

Unfortunately, GENERAL — as computed by Algorithmjs] — requires 0{nL) time per 
«-gram. Indeed, shifting a value n times in GF(2)[x]/p(x) requires 0{nL) time. How- 
ever, if we are willing to trade memory usage for speed, we can precompute these 
shifts. We call the resulting scheme RAM-Buffered GENERAL. 

Lemma 2 Pick any p{x) in GF(2) [x]. The degree of p{x) is L. Represent elements of 
GF(2) [x\/p(x) as polynomials of degree at most L — 1. Given any h in GF(2) [x]/p(x). 
we can compute x"h in 0(L) time given an 0{L2")-bit memory buffer 

Proof Write h as Lfjo ^i-^'- Divide h into two parts, /iC^ = YJiZq '^ qt^' and /z^) = 
Y!^ll_„qix\ sothat/! = /!(i)+/i<2). Thenjc"/! = jc"/!(i)+jc"/!(2). The first part, jc"/;*!) is a 
polynomial of degree at most L — 1 since the degree of is at most L — 1 — n. Hence, 
x^h'^^'^ as an L-bit value is just qi-n-xqi-n-i ■ ■ -qoO- ■ .0. which can be computed in 
time 0{L). So, only the computation of x"h^^^ is possibly more expensive than 0{L) 
time, but /i^^' has only n terms as a polynomial (since the first L — n terms are always 
zero). Hence, if we precompute for all 2" possible values of h^^\ and store 

them in an array with 0{L) time look-ups, we can compute x"h as an L-bit value in 
0(L) time. 

When n is large, this precomputation requires excessive space and precomputation 
time. Fortunately, we can trade back some speed for memory. Consider the proof of 



Lemma[2] Instead of precomputing the shifts of all 2" possible values of /z^^' using an 
array of 2" entries, we can further divide hP'"' into K parts. For simplicity, assume that 
the integer K divides n. The K parts /i'^'^^ . . . are made of the first n/ZT bits, the 

next n/K bits and so on. Because (2) = Yf^^x"h^^-'\ we can shift /jP) by n in 0{KL) 
operations using K arrays of 2"^^ entries. To summarize, we have a time complexity 
of 0{KL) per n-gram using 0{L\I,\+ LKl"/'^) bits. We implemented the case K — 2. 

9. Recursive hashing by cyclic polynomials is not even uniform 

Choosing p{x) = + 1 for L > n, for any polynomial q{x) — Y,f=o Qi^'^ we have 

x'q{x) = x' {qi-ix^^^ -\ h^ix + ^o) = qi-i-ix^^''^ -\ \- qL-i+lx + qi-i- 

Thus, we have that multiplication by x' is a bitwise rotation, a cyclic left shift — which 
can be computed in 0{L) time. The resulting hash (see Algorithm[4]) is called CYCLIC. 
It requires only 0{L) time per hash value. Empirically, Cohen showed that CYCLIC is 
uniform 1 17 |. In contrast, we show that it is not formally uniform: 



Algorithm 4 The recursive CYCLIC family. 

Require: an L-bit hash function hi over E from an independent hash family 

1: s ^ empty FIFO structure 

2: X ^ (L-bit integer) 

3: z ^ (L-bit integer) 

4: for each character c do 

5: append c to s 

6: rotate x left by 1 bit 

7: rotate z left by n bits 

8: X^xQ)Z®hi{c) 

9: if length(i)= n then 

10: yield x 

11: remove oldest character y from s 

12: z^hiiy) 

13: end if 

14: end for 



Lemma 3 Cyclic is not uniform for n even and never 2-universal, and thus never 
pairwise independent. 

Proof If n is even, use the fact that x"^ ' + •••+ x + 1 is divisible by x + 1 to 
write x"^^ + ■■■ +x+ 1 = (x+ l)r{x) for some polynomial r{x). Clearly, r{x){x + 

l)(x^"' +x^^^-\ 1) =0 mod 1 for any r{x) and so P{h{a\,a\,. ..,a\) = 

0) =P((x"-' + ---+x+l)/!i(fli) =0) = P{{x+\)r{x)hi{ai) =0) >P(/ii(fli) = 

OV/!i(fli) =x^"' 1) = 1/2^"^ Therefore, CYCLIC is not uniform 

for n even. 

To show Cyclic is never pairwise independent, consider n — 3> (for simplicity), 

then P{h{ai,a\,a2) = h{ai,a2,ai)) = P{{x+ +/ii(fl2)) = 0) > P{h\{a\) + 



Table 3: 


Cyclic hash for various values of /i| (a) (A(a, a) = .xh\ 


(a) + /ii(a) mod2'-+l) 


/ii(a) 


/i(a, a) 


/i(a, a) 


h{a, a) 


h{a, a) 






(first two bits) 


(last two bits) 


(first and last bit) 


000 


000 


00 


00 


00 


100 


110 


11 


10 


10 


010 


oil 


01 


11 


01 


110 


101 


10 


01 


11 


001 


101 


10 


01 


11 


101 


oil 


01 


11 


01 


oil 


110 


11 


10 


10 


111 


000 


00 


00 


00 



hi{a2) = OV/!i(fli) +/!i(a2) =x^"'+x^~^H hx+l) = l/2^'\ but 2-universal 

hash values are equal with probability 1 /2^. The result is shown. 

Of the four recursive hashing functions investigated by Cohen fT7], GENERAL and 
Cyclic were superior both in terms of speed and uniformity, though CYCLIC had a 
small edge over General. For n large, the benefits of these recursive hash functions 
compared to the 3-wise independent hash function presented earlier can be substantial: 
n table look-ups is much more expensive than a single look-up followed by binary 
shifts. 

10. Cyclic is pairwise independent if you remove n—l consecutive bits 

Because Cohen found empirically that CYCLIC had good uniformity |T7l, it is rea- 
sonable to expect Cyclic to be almost uniform and maybe even almost pairwise in- 
dependent. To illustrate this intuition, consider Table|3]which shows that while h{a,a) 
is not uniform ih{a, a) = 001 is impossible), /i(a, a) minus any bit is indeed uniformly 
distributed. We will prove that this result holds in general. 

The next lemma and the next theorem show that CYCLIC is quasi-pairwise inde- 
pendent in the sense that L — n+ I consecutive bits (e.g., the first or last L — n + 1 
bits) are pairwise independent. In other words, CYCLIC is pairwise independent if 
we are willing to sacrifice n—l bits. (We say that n bits are "consecutive modulo 
L" if the bits are located at indexes / mod L for n consecutive values of / such as 
i — k,k+l,...,k+n— I.) 

Lemma 4 If q{x) e GF(2)[x]/(x^+ 1) (with q{x) ^ Oj has degree n<L, then 

• the equation q{x)w = y mod + 1 modulo the first n bit^has exactly 2" solu- 
tions for all y; 



'By "equality modulo {some specified set of bit positions)" , we mean that the two quantities are bitwise 
identical, with exceptions permitted only at the specified positions. For our polynomials, "equality modulo 
the first n bit positions" implies the difference of the two polynomials has degree at most n — I . 



• more generally, the equation q{x)w = y mod + 1 modulo any consecutive 
n bits (modulo L) has exactly 2" solutions for all y. 



Proof Let P be the set of polynomials of degree at most L — « — 1. Take any p{x) €E 
P, then q{x)p{x) has degree at most L — n— l+n — L— I and thus if q{x) ^ and 
p{x) ^ 0, then q{x)p{x) ^ modx^+ 1. Hence, for any distinct p\,p2 G -P we have 
q{x)p\ ^ q{x)p2 mod x^ +1. 

To prove the first item, we begin by showing that there is always exactly one 
solution in P. Consider that there are 2^^" polynomials p{x) in P, and that all 
values q{x)p{x) are distinct. Suppose there are p\,p2 G P such that q{x)pi = 
q{x) p2 vaod x^ + \ modulo the first n bits, then q{x){p\ — P2) is a polynomial of 
degree at most n — 1 while pi — p2 is a polynomial of degree at most L — n — 1 
and q{x) is a polynomial of degree «, thus pi — P2 = 0- (If pi — p2 ^ Q then 
degree(^(;ic)(pl ~ p2) modx'^+ 1) > degree(^(;ic)) = «, a contradiction.) Hence, all 
p{x) in P are mapped to distinct values modulo the first n bits, and since there are 2^^" 
such distinct values, the result is shown. 

Any polynomial of degree L — 1 can be decomposed into the form p{x) +x^^"z{x) 
where z{x) is a polynomial of degree at most « — 1 and p{x) £ P. By the preceding 
result, for distinct pi,p2 S P, q{x){x^^"z{x) + pi) and q{x){x^^"z{x) + P2) must be 
distinct modulo the first n bits. In other words, the equation q{x){x^^"z{x) + p) =y 
modulo the first n bits has exactly one solution p E P for any z{x) and since there are 
2" polynomials z{x) of degree at most n — 1, then q{x)w = y (modulo the first n bits) 
must have 2" solutions. 

To prove the second item, choose j and use the first item to find any w solving 
q{x)w = yx-' mod x^+l modulo the first n bits. j. Then wx^^^ is a solution to q{x)w = 
y mod x^ +1 modulo the bits in positions j,j+l,...,j + n— 1 mod L. 

We have the following corollary to Lemma|4] 

Corollary 1 Ifw is chosen uniformly at random in GF{2) [x]/{x^ + 1)> then P{q{x)w = 
y mod « — 1 bits) — 1 /2^^"+' where the n — 1 bits are consecutive (modulo L). 

Theorem 1 Consider the L-bit CYCLIC n-gram hash family. Pick any n—\ consecu- 
tive bit locations, then remove these bits from all hash values. The resulting L — n+l- 
bit hashfrimily is pairwise independent. 

Proof We show P{qi{x)hi{ai) + q2{x)hi{a2) + ■■■ + qn{x)hi{a„) = 
y mod n — 1 bits) = 1/2^^"+' for any polynomials qi where at least one is different 
from zero. It is true when there is a single non-zero polynomial qi{x) by Corollary [T| 
Suppose it is true up to A: — 1 non-zero polynomials and consider a case where we 
have k non-zero polynomials. Assume without loss of generality that qi{x) ^ 0, 
we have P{qi{x)hi{ai) + q2{x)hi{a2) + ■ ■ ■ + q„{x)hi{a„) = y mod n — 1 bits) = 
P{qi(x)hi{ai) — y — q2ix)hi{a2) — ••• — <7„(x)/ii (fl„) mod n — 1 bits) = 
T.y' P{qi{x)hi{ai) ^ y - y' modn-lhits)P{q2{x)hi{a2) + ■■■ + q„{x)h\{an) = 
y' mod n — 1 bits) — Y,yi 7L-11+1 2^-"+' ^ 1/2^^"+^ by the induction argument, where 
the sum is over 2^^"+^ values of / . Hence the uniformity result is shown. 



Consider two distinct sequences ai,a2, ■ ■ ■ ,a„ and a\,a2,. . . ,a'^. Write Ha = 
,fl2, • • • i^n) and //„/ = /!(fl[,fl2, . . . ,flj,). To prove pairwise independence, it suf- 
fices to show that = y mod n - 1 bits|//„/ = / mod n - 1 bits) = 1 /2^""+'. Sup- 
pose that a, = a'j for some /, j; if not, the result follows by the (full) independence 
of the hashing function hi. Using Lemma |4j find q{x) such that ^(-^) LA-|aJ,=n' *^ = 

~LA:|aj-=a, -^" mod 11—1 bits, then //„ + q{x)Hai mod n — 1 bits is independent from 
a,- = a'j (and h\{ai) = hi{a'j)). 

The hashed values /ii («<:) for a<. 7^ a,- and /zi (a^J,) for 7^ a^. are now relabelled as 
hi{bi),... ,hi{b,n)- Write Ha + q{x)Ha' = Y.kQk{x)hi{bi^) mod n — 1 bits where qk{x) 
are polynomials in GF(2)[x]/(jic^ + 1) (not all qk{x) are zero). As in the proof of 
Lemma [T| we have that Hj = / rnod n — 1 bits and //„ + q{x)Hai = y + q{x)y' mod 
n— Ibits are independenrl = / mod « — 1 bitsj/ ,Z7i,Z72, . . . ,fem) = 1/2^^"+' 

by Corollary |lj since //„/ =37 can be written as r{x)h\{a'j) ~ y — Y^k ^k{x)hi{bk) for 
some polynomials r (x) , ri (x) , . . . , r„, (x) . Hence, we have 

P{Ha = y mod « — 1 bits|//„/ / mod n — 1 bits) 

= P{Ha + q{x)Hai = y + q{x)y mod n — 1 bits|//„/ =/ mod n — 1 bits) 
= -P(//„ + q{x)Hui —y + q{x)y' mod n — 1 bits) 
= P(^qk{x)hi{bk) —y + q{x)y' mod n— 1 bits) 

and by the earlier uniformity result, this last probability is equal to 1 /2^^"+'. 

11. Experimental comparison 

Irrespective of p{x), computing hash values has complexity Q.{L). For GENERAL 
and Cyclic, we require L>n. Hence, the computation of their hash values is in i2(n). 
For moderate values of L and n, this analysis is pessimistic because CPUs can process 
32- or 64-bit words in one operation. 

To assess their real-world performance, the various hashing algorithm^ were writ- 
ten in C++. We compiled them with the GNU GCC 4.0.1 compiler on an Apple Mac- 
Book with two Intel Core 2 Duo processors (2.4 GHz) and 4 GiB of RAM. The -03 
compiler flag was used since it provided slightly better performance for all algorithms. 
All hash values are stored using 32-bit integers, irrespective of the number of bits used. 

All hashing functions generate 19-bit hash values, except for CYCLIC which gen- 
erates 19H-n-bit hash values. We had CYCLIC generate more bits to compensate for the 
fact that it is only pairwise independent after removal of « — 1 consecutive bits. For 
General, we used flie polynomial p{x) = x'^ +x'^ +x^^ +x'^ +x'2 +x^ +x^ +x^ + 
x^ +x^ + 1 f25l. For Randomized Karp-Rabin, we used the ID37 family. The character 
hash-values are stored in an array for fast look-up. 



^We use the shorthand notation P{f{x,y) = c\x,y) = b to mean P{f{x,y) = c\x = zi,y = zi) = b for all 

val ues of zi,Z2. 

■ http : / / code . google . com/p/ngramhashing/ 
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Figure 1 : Wall-clock ranning time to hash all «-grams in the King James Bible 

We report wall-clock time in Fig. [TJfor hashing the «-grams of the King James 
Bible 120 1 which contains 4.3 million ASCII characters. CYCLIC is twice as fast as 
General. As expected, the running time of the non-recursive hash function (3-wise) 
grows linearly with n: for n = 5, 3-wise is already seven times slower than CYCLIC. 
Speed-wise, Randomized Karp-Rabin (ID37) is the clear winner, being nearly twice as 
fast as Cyclic. The performance of Cyclic and ID37 is oblivious to n in this test. 

The RAM-Buffered GENERAL timings are — as expected — independent of n, but 
they are twice as large as the Cyclic timings. We do not show the modified version 
of RAM-Buffered GENERAL that uses two precomputed arrays instead of a single one. 
It was approximately 30% slower than ordinary RAM-Buffered GENERAL, even up to 
n = 25. However, its RAM usage was 3 orders of magnitude smaller: from 135 MB 
down to 25 kB. Overall, we cannot recommend RAM-Buffered GENERAL or its mod- 
ification considering that (1) its memory usage grows as 2" and (2) it is slower than 
Cyclic. 

12. Conclusion 

Considering speed and pairwise independence, we recommend CYCLIC — after dis- 
carding n — 1 consecutive bits. If we require only uniformity. Randomized Integer- 
Division is twice as fast. 
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